Browsing and Retrieval of Full Broadcast-Quality Video

نویسندگان

David Gibbon

Andrea Basso

Reha Civanlar

Qian Huang

Esther Levin

Roberto Pieraccini

چکیده

In this paper we describe a system we have developed for automatic broadcast-quality video indexing that successfully combines results from the fields of speaker verification, acoustic analysis, very large vocabulary speech recognition, content based sampling of video, information retrieval, natural language processing, dialogue systems, and MPEG2 delivery over IP. Our audio classification and anchorperson detection (in the case of news material) classifies video into news versus commercials using acoustic features and can reach 97% accuracy on our test data set. The processing includes very large vocabulary speech recognition (over 230K-word vocabulary) for synchronizing the closed caption stream with the audio stream. Broadcast news corpora are used to generate language models and acoustic models for speaker identification. Compared with conventional discourse segmentation algorithms based on only text information, our integrated method operates more efficiently with more accurate results (> 90%) on a test database of 17 one half-hour broadcast news programs. We have developed a natural language dialogue system for navigating large multimedia databases and tested it on our database of over 4000 hours of television broadcast material. Story rendering and browsing techniques are employed once the user has restricted the search to a small subset of database that can be efficiently represented in few video screens. We focus on the advanced home television as the target appliance and we describe a flexible rendering engine that maps the user-selected story data through application-specific templates to generate suitable user interfaces. Error resilient IP/RTP/RTSP MPEG-2 media control and streaming is included in the system to allow the user to view the selected video material. Introduction Recent technical advances are enabling a new class of consumer applications involving browsing and retrieval of full broadcast-quality video. Cable modems are bringing megabit bandwidth to the home. Set top boxes include low cost MPEG2 decoders and can also include an HTTP client for web browsing. Disk storage technology is riding a Moore’s law curve and is currently at a dollar-per-megabyte point that makes large digital video archives practical. Selective delivery of digital video over IP is less well established than these other technologies, but rapid progress is being made in this area. Systems that build upon these component technologies in novel ways can bring new services that meet consumers needs. In this paper we present an example system for selective retrieval of broadcast television news with full broadcast quality. Video indexing is one of the areas in which further technology development is needed. To build systems for video database retrieval, we need standard storage and communication protocols at several levels for handling the video program attributes, key frames, associated text streams such as the closed caption, in addition to the usual issues associated with storage and delivery of the bulk media. (Video program attributes include such things as the program title, copyright, etc.) Currently international standards bodies are active in this area. In particular MPEG7 aims to address this as a natural evolution of the MPEG video standards of the past [MPG7.] The IETF is also working on standards focused in the areas where television and the Internet intersect [Hoschka98, Kate98a, and Kate98b.] Meanwhile the industry is moving ahead with implementations that may become de-facto standards in their own right. For example at the system level we have Microsoft ASF [Fleischman98] and Real Networks [Real98] and at the application level there are Virage and ISLIP and others [Virage98, Islip98, FasTV98, Magnifi98, Excalibur98.] When these indexing components become stable, we will have all the building blocks necessary to create systems for browsing video databases that have been manually created. The major broadcasters will likely begin to generate indexes of each video program as part of the normal production process. However, for a significant time, smaller broadcasters will not have the necessary resources to do this. Further, it will be too costly to manually index the many large (several hundred thousand hour) video news archives that exist. An automated video indexing system is required for these applications. Several such systems have been proposed (for example, at CMU and AT&T [Wactlar98, Shahraray95b].) A common aspect of the successful systems is true multimedia processing in which state of the art techniques from several disciplines are employed. In fact, in many cases it is necessary to extend existing methods in the process of applying them to the domain of video indexing. In this paper we describe the successful combination of results from the fields of speaker verification, acoustic analysis, very large vocabulary speech recognition, content based sampling of video, information retrieval, natural language processing, dialogue systems, and MPEG2 delivery over IP networks. The display device itself imposes an additional challenge for broadcast-quality video indexing systems intended for consumers. Typical home PC’s are not well designed for the display of interlaced video, and the comparatively low resolution of television sets is not capable of rendering the browsing and retrieval user interfaces (UIs) of the aforementioned systems. We will focus on the home television as the display device. Much of the prior work assumes a desktop environment in which high-resolution graphics are available for the UI, and “postage stamp” quality video is sufficient. In these systems QCIF resolution video serves the purpose of identifying a particular video clip from a database, and for verifying that it is relevant to a given query. This quality is not sufficient for actually watching the video in the way to which a television viewer is accustomed. There are several issues to be addressed when designing user interfaces for television-resolution applications. Even though the addressable resolution is approximately 640x480 for NTSC displays, the usable resolution is further reduced by overscan and interlacing. Fonts must be large, and preferably smoothed to avoid interlacing artifacts. Also, usage paradigms dictate that scrollbars are to be avoided [WebTV98]. In summary, prior systems employ high-resolution UIs to access low-resolution video, we are concerned here with low-resolution UIs for accessing high-resolution video. Media Processing To facilitate browsing of large collections of video material on limited resolution terminals, organizing the data in a structured manner, and presenting this structure through natural user interfaces is necessary for a successful system. The process of populating the collection must be automated as much as possible. At the highest level, video programs have attributes such as “network,” “title,” and “genre.” Any finer grained classification or segmentation will also aid the searching and browsing process. For the case of broadcast news video, we can automatically segment video into semantically meaningful units such as stories and summaries, and derive a content hierarchy [Huang99b]. For the remainder of this section, we will focus on broadcast news content, but much of the processing can be applied to other types of content as well (e.g. commercial/program segmentation, closed caption synchronization.) Please refer to Figure 1 for an overview of the media processing. We assume that we are starting with elementary MPEG2 audio and video streams including a closed caption data stream. In a later section we describe the preprocessing required for streaming the media. Also the content based sampling and representative image selection process are covered in the section on “story rendering for browsing.” Audio Classification and Anchorperson Detection Our method first classifies video into news versus commercials using acoustic features. Current algorithms can reach 97% accuracy on our test data set. Within the news segments, we further identify smaller segments containing the anchorperson. Gaussian mixture model based anchor and background models are trained using 12 mel cepstral and 12 delta cepstral features and then used to compute the likelihood values estimated from the audio signal [Rosenberg98]. With these techniques, we have been able to achieve a 92% correct segment identification rate, with a low false alarm rate on our test data set [Huang99a.] By further integrating audio characteristics with visual information in recognizing the anchor, the detection precision on detected segments are significantly improved. Such recognized anchor segments serve two purposes. First, they hypothesize the news story boundaries because every news story starts with anchor’s introduction (vise versa is not true though). Story Segmentation The recognized anchor speech segments the synchronized (see below) closed caption stream into blocks of text to which text analysis algorithms are applied to extract individual news stories. Compared with conventional discourse segmentation algorithms based on only text information, our integrated method operates more efficiently with more accurate results (> 90%) on a test database of 17 one half-hour broadcast news programs. The second purpose in identifying anchor segments is to extract condensed news summaries or introductions spoken by the anchor which are separate from the body of the story. By playing back only the news summary, a user can experience “glancing at headline stories”, similar to reading only the front page of a newspaper, before deciding which headline story to browse further. Caption Synchronization Real-time closed captioning lags behind the audio and video by a variable amount of time from about 1 to 10 seconds. Since the story segmentation uses all three media streams to perform the segmentation, errors can be introduced if the caption is not aligned with the other media streams. We have had success using a large vocabulary speech recognizer to generate very accurate word timestamps (although the word error rate is considerable.) The system was built using broadcast news corpora and has lexicon of over 230K words [Choi99]. The recognizer runs in 5 times real time on a 400Mhz NT machine using the AT&T Watson Engine with about a 40% word error rate. After recognition, a parallel text alignment is performed to import the more accurate timing information from the automatic speech transcription to the more accurate transcription of the closed caption (the method is similar to that described in [Gibbon98].) Figure 1: Media processing to determine story segmentation and representative images and terms Multimedia Database The story segmentation, keywords, and story image descriptors are stored in a multimedia database along with the full closed caption text and the audio and video media. We have built upon our prior experience with a 4000+ hour heterogeneous multimedia database [Shahraray97.] The database is comprised primarily of broadcast television news programs from NBC, CNN and PBS over the period from 1994 through 1999. Each database entry includes closed caption text and still frames, and may also include 16Kbs audio, 128Kbps video, or 5Mbps MPEG2 video. This large number of items proved useful in designing the dialogue interface described below. Through a navigation process described below, users select a particular story for display. A rendering engine takes the story data and maps it through templates to generate user interfaces. This template-based Video

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semantic Retrieval of Video

In this article we will review different research works in 3 types of video, i.e., video of meetings, movies and broadcast news, and sports video. We will then put them into a general framework of video summarization, browsing, and retrieval. We will also review different video representation techniques for these three types of video content within this general framework. At last we will presen...

متن کامل

Content-based search and browsing in semantic multimedia retrieval

Growth in storage capacity has led to large digital video repositories and complicated the discovery of specific information without the laborious manual annotation of data. The research focuses on creating a retrieval system that is ultimately independent of manual work. To retrieve relevant content, the semantic gap between the searcher's information need and the content data has to be overco...

متن کامل

Evaluation of Automatic Shot Boundary Detection on a Large Video Test Suite

The challenge facing the indexing of digital video information in order to support browsing and retrieval by users, is to design systems that can accurately and automatically process large amounts of heterogeneous video. The segmentation of video material into shots and scenes is the basic operation in the analysis of video content. This paper presents a detailed evaluation of a histogram-based...

متن کامل

Examining User Interactions with Video Retrieval Systems

The Informedia group at Carnegie Mellon University has since 1994 been developing and evaluating surrogates, summary interfaces, and visualizations for accessing digital video collections containing thousands of documents, millions of shots, and terabytes of data. This paper reports on TRECVID 2005 and 2006 interactive search tasks conducted with the Informedia system by users having no knowled...

متن کامل

Multimedia surrogates for video gisting: Toward combining spoken words and imagery

Good surrogates that allow people to quickly derive the gist of videos without taking the time to view the full video are crucial to video retrieval and browsing systems. Although there are many kinds of textual and visual surrogates used in video retrieval systems, there are few audio surrogates in practice. To evaluate the effectiveness of audio surrogates alone and in combination with one ki...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

Browsing and Retrieval of Full Broadcast-Quality Video

نویسندگان

چکیده

منابع مشابه

Semantic Retrieval of Video

Content-based search and browsing in semantic multimedia retrieval

Evaluation of Automatic Shot Boundary Detection on a Large Video Test Suite

Examining User Interactions with Video Retrieval Systems

Multimedia surrogates for video gisting: Toward combining spoken words and imagery

عنوان ژورنال:

اشتراک گذاری